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ON THE ADAPTIVE ELASTIC-NET WITH A DIVERGING 
NUMBER OF PARAMETERS 

By Hui Zou^ and Hao Helen Zhang^ 
University of Minnesota and North Carolina State University 

We consider the problem of model selection and estimation in 
situations where the number of parameters diverges with the sample 
size. When the dimension is high, an ideal method should have the 
oracle property [J. Amer. Statist. Assoc. 96 (2001) 1348-1360] and 
[Ann. Statist. 32 (2004) 928-961] which ensures the optimal large 
sample performance. Furthermore, the high-dimensionality often in- 
duces the coUinearity problem, which should be properly handled 
by the ideal method. Many existing variable selection methods fail 
to achieve both goals simultaneously. In this paper, we propose the 
adaptive elastic-net that combines the strengths of the quadratic reg- 
ularization and the adaptively weighted lasso shrinkage. Under weak 
regularity conditions, we establish the oracle property of the adap- 
tive elastic-net. We show by simulations that the adaptive elastic-net 
deals with the coUinearity problem better than the other oracle-like 
methods, thus enjoying much improved finite sample performance. 

1. Introduction. 

1.1. Background. Consider the problem of model selection and estima- 
tion in the classical linear regression model 

(1.1) y = X/3*+£, 

where y = (yi, . . . , yn)^ is the response vector and Xj = {xij, . . . , Xnj)"^ ,j = 
1, . . . ,p, are the linearly independent predictors. Let X = [xi, . . . , x„] be the 
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predictor matrix. Without loss of generality, we assume the data are cen- 
tered, so the intercept is not included in the regression function. Through- 
out this paper, we assume the errors are identically and independently dis- 
tributed with zero mean and finite variance a^. We are interested in the 
sparse modeling problem where the true model has a sparse representa- 
tion (i.e., some components of (3* are exactly zero). Let A = {j : Pj 0,j = 
1,2, . . . ,p} . In this work, we call the size of A the intrinsic dimension of the 
underlying model. We wish to discover the set A and estimate the corre- 
sponding coefficients. 

Variable selection is fundamentally important for knowledge discovery 
with high-dimensional data [Fan and Li (2006)] and it could greatly enhance 
the prediction performance of the fitted model. Traditional model selection 
procedures follow best-subset selection and its step-wise variants. However, 
best-subset selection is computationally prohibitive when the number of pre- 
dictors is large. Furthermore, as analyzed by Breiman (1996), subset selec- 
tion is unstable; thus, the resulting model has poor prediction accuracy. 
To overcome the fundamental drawbacks of subset selection, statisticians 
have recently proposed various penalization methods to perform simulta- 
neous model selection and estimation. In particular, the lasso [Tibshirani 
(1996)] and the SCAD [Fan and Li (2001)] are two very popular meth- 
ods due to their good computational and statistical properties. Efron et 
al. (2004) proposed the LARS algorithm for computing the entire lasso so- 
lution path. Knight and Fu (2000) studied the asymptotic properties of the 
lasso. Fan and Li (2001) showed that the SCAD enjoys the oracle property, 
that is, the SCAD estimator can perform as well as the oracle if the penal- 
ization parameter is appropriately chosen. 

1.2. Two fundamental issues with the i\ 'penalty. The lasso estimator 
[Tibshirani (1996)] is obtained by solving the l\ penalized least squares 
problem 

(1.2) 3(lasso) = argmin ||y - X/3||^ + A||/3||i, 

where ||/3||i =Z]j=il/5jl is the ^i-norm of /3. The i\ penalty enables the 
lasso to simultaneously regularize the least squares fit and shrink some com- 
ponents of /3(lasso) to zero for some appropriately chosen A. The entire 
lasso solution paths can be computed by the LARS algorithm [Efron et al. 
(2004)]. These nice properties make the lasso a very popular variable selec- 
tion method. 

Despite its popularity, the lasso does have two serious drawbacks: namely, 
the lack of oracle property and instability with high-dimensional data. First 
of all, the lasso does not have the oracle property. Fan and Li (2001) first 



ADAPTIVE ELASTIC-NET 



3 



pointed out that asymptotically the lasso has nonignorable bias for estimat- 
ing the nonzero coefficients. They further conjectured that the lasso may 
not have the oracle property because of the bias problem. This conjecture 
was recently proven in Zou (2006). Zou (2006) further showed that the lasso 
could be inconsistent for model selection unless the predictor matrix (or the 
design matrix) satisfies a rather strong condition. Zou (2006) proposed the 
following adaptive lasso estimator 

_ p 

(1.3) /3(AdaLasso) = argmin ||y — X/3||2 + Wj\f3j\, 

where {wj}^^-^ are the adaptive data-driven weights and can be computed 

by Wj = (|/3j™|)~'*', where 7 is a positive constant and /3 is an initial root-n 
consistent estimate of (3. Zou (2006) showed that, with an appropriately cho- 
sen A, the adaptive lasso performs as well as the oracle. Candes, Wakin and Boyd 
(2008) used the adaptive lasso idea to enhance sparsity in sparse signal re- 
covery via the reweighted £1 minimization. 

Secondly, the ii penalization methods can have very poor performance 
when there are highly correlated variables in the predictor set. The collinear- 
ity problem is often encountered in high-dimensional data analysis. Even 
when the predictors are independent, as long as the dimension is high, the 
maximum sample correlation can be large, as shown in Fan and Lv (2008). 
Collinearity can severely degrade the performance of the lasso. As shown in 
Zou and Hastie (2005), the lasso solution paths are unstable when predic- 
tors are highly correlated. Zou and Hastie (2005) proposed the elastic-net as 
an improved version of the lasso for analyzing high-dimensional data. The 
elastic-net estimator is defined as follows: 

(1.4) 3(enet) = (l + ^) |argmin||y - X(3\\l + A2||/3||i + Ai||/3||i|. 

If the predictors are standardized (each variable has mean zero and L2-iiorm 
one), then we should change (1 + ^) to (I-I-A2) as in Zou and Hastie (2005). 
The £1 part of the elastic-net performs automatic variable selection, while 
the £2 part stabilizes the solution paths and, hence, improves the prediction. 
In an orthogonal design where the lasso is shown to be optimal Donoho et al. 
(1995), the elastic-net automatically reduces to the lasso. However, when the 
correlations among the predictors become high, the elastic-net can signifi- 
cantly improve the prediction accuracy of the lasso. 

1.3. The adaptive elastic-net. The adaptively weighted £1 penalty and 
the elastic-net penalty improve the lasso in two different directions. The 
adaptive lasso achieves the oracle property of the SCAD and the elastic-net 
handles the collinearity. However, following the arguments in Zou and Hastie 
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(2005) and Zou (2006), we can easily see that the adaptive lasso inherits the 
instability of the lasso for high-dimensional data, while the elastic-net lacks 
the oracle property. Thus, it is natural to consider combining the ideas of the 
adaptively weighted ii penalty and the elastic-net regularization to obtain 
a better method that can improve the lasso in both directions. To this end, 
we propose the adaptive elastic-net that penalizes the squared error loss 
using a combination of the £2 penalty and the adaptive ii penalty. Since the 
adaptive elastic-net is designed for high-dimensional data analysis, we study 
its asymptotic properties under the assumption that the dimension diverges 
with the sample size. 

Pioneering papers on asymptotic theories with diverging number of pa- 
rameters include [Huber (1988) and Portnoy (1984)] which studied the M- 
estimators. Recently, Fan, Peng and Huang (2005) studied a semi-parametric 
model with a growing number of nuisance parameters, whereas Lam and Fan 
(2008) investigated the profile likelihood ratio inference for the growing num- 
ber of parameters. In particular, our work is influenced by Fan and Peng 
(2004) who studied the oracle property of nonconcave penalized likelihood 
estimators. Fan and Peng (2004) provocatively argued that it is important 
to study the validity of the oracle property when the dimension diverges. We 
would like to know whether the adaptive elastic-net enjoys the oracle prop- 
erty with a diverging number of predictors. This question will be thoroughly 
investigated in this paper. 

The rest of the paper is organized as follows. In Section 2, we introduce 
the adaptive elastic-net. Statistical theory, including the oracle property, of 
the adaptive elastic-net is established in Section 3. In Section 4, we use sim- 
ulation to compare the finite sample performance of the adaptive elastic-net 
with the SCAD and other competitors. Section 5 discusses how to com- 
bine SIS of Fan and Lv (2008) and the adaptive elastic-net to deal with the 
ultra-high dimension cases. Technical proofs are presented in Section 6. 

2. Method. The adaptive elastic-net can be viewed as a combination of 
the elastic-net and the adaptive lasso. Suppose we first compute the elastic- 
net estimator /3(enet) as defined in (1.4), and then we construct the adaptive 
weights by 

(2.1) Wj = {\Pj{enet)\)-"', j = l,2,...,p, 

where 7 is a positive constant. Now we solve the following optimization 
problem to get the adaptive elastic-net estimates 

3(AdaEnet) 

= [^ + ^) jargmin ||y - X/3||i + X2\\(3\\l + KT. ^jWj- 



ADAPTIVE ELASTIC-NET 



5 



From now on, we write (3 = /3(AdaEnet) for the sake of convenience. 

If we force A2 to be zero in (2.2), then the adaptive elastic-net reduces to 
the adaptive lasso. Following the arguments in Zou and Hastie (2005), we 
can easily show that in an orthogonal design the adaptive elastic-net reduces 
to the adaptive lasso, regardless the value of A2. This is desirable because, 
in that setting, the adaptive lasso achieves the optimal minimax risk bound 
[Zou (2006)]. The role of the £2 penalty in (2.2) is to further regularize the 
adaptive lasso fit whenever the collinearity may cause serious trouble. 

We know the elastic-net naturally adopts a sparse representation. One 
can use Wj = (|/?j(enet)| + 1/n)^'^ to avoid dividing zeros. We can also define 
Wj = 00 when /3j(enet) = 0. Let ^cnct = {j ■ /3j(enet) / 0} and .4gnet denotes 
its complement set. Then, we have (3-Tc =0 and 



(2.3) 



enet 

A2 



argmin||y-X^ /3||^ + As^H^ + }^ w,\(ij 

^ ie^enet 



where /3 in (2.3) is a vector of length |.4.cnct|i the size of -Aenet- 

The i\ regularization parameters A^ and Ai are directly responsible for 
the sparsity of the estimates. Their values are allowed to be different. On 
the other hand, we use the same A2 for the I2 penalty component in the 
elastic-net and the adaptive elastic-net estimators, because the £2 penalty 
offers the same kind of contribution in both estimators. 

3. Statistical theory. In our theoretical analysis, we assume the following 
regularity conditions throughout: 

(Al) We use Amin(lV[) and Amax(M) to denote the minimum and max- 
imum eigenvalues of a positive definite matrix M, respectively. Then, we 
assume 

h < A^,i, f ix^X^ < A,,,,. fix^X^ < B, 



n J \n 
where b and B are two positive constants. 

^^zj nm™ „ " = 0; 

(A3) -E'[|ep+'^] < 00 for some 6 > 0; 

(A4) lim„^oo gj^l = J, for some < < 1. 

To construct the adaptive weights (u)), we take a fixed 7 such that 7 > 
In our numerical studies, we let 7 = + 1 to avoid the tuning on 7. 

Once 7 is chosen, we choose the regularization parameters according to the 
following conditions: 



(A5) 
and 



(A6) 
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lim — = 0, lim — L = 

n — ^oo 77, n — >oo Wfl 



lim ^ = 0, lim ^ ^_ 



lim 4^ /V/3f = 0, 



1/7N 



. / n / ^ \ • so* 

iim mm , mm p,- 



oo. 



Conditions (Al) and (A2) assume the predictor matrix has a reasonably 
good behavior. Similar conditions were considered in Portnoy (1984). Note 
that in the linear regression setting, condition (Al) is exactly condition (F) 
in Fan and Peng (2004). Condition (A3) is used to establish the asymptotic 
normality of /3(AdaEnet). 

It is worth pointing out that condition (A4) is weaker than that used in 
Fan and Peng (2004), in which p is assumed to satisfy p^/n — > or at most 
p^/n — > 0. It means their results require v <\^. Our theory removes this 
limitation. For any < z/ < 1, we can choose an appropriate 7 to construct 
the adaptive weights and the oracle property holds as long as 7 > Also 
note that, in the finite dimension setting, = 0; thus, any positive 7 can be 
used, which agrees with the results in Zou (2006). 

Condition (A6) is similar to condition (H) in Fan and Peng (2004). Ba- 
sically, condition (A6) allows the nonzero coefficients to vanish but at a 
rate that can be distinguished by the penalized least squares. In the finite 
dimension setting, the condition is implicitly assumed. 

Theorem 3.1. Given the data (y,X), let w = (wi, . . . ,Wp) be a vector 
whose components are all nonnegative and can depend on (y , X) . Define 

3w(A2, Al) = argmin j ||y - X/3||2 + X^ll + Ai ^ 
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for nonnegative parameters A2 and Ai. Ifwj = 1 for all j , we denote (3^{\2, Ai) 
hy /3(A2,Ai) for convenience. 

If we assume the model (1.1) and condition (Al), then 

E{\\f3^{\2, Al) - 13* 111) < 4^^^^ ' '^'-^ 



In particular, when Wj = 1 for all j , we have 

i?(||/3(A2,Ai)-/3 y <4 1^^^^2 • 
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It is worth mentioning that the derived risk bounds are nonasymptotic. 
Theorem 3.1 is very useful for the asymptotic analysis. A direct corollary of 
Theorem 3.1 is that, under conditions (A1)-(A6), /3(A2, Ai) is a root-(n/p)- 
consistent estimator. This consistent rate is the same as the result of SCAD 
[Fan and Peng (2004)]. The root-(n/p) consistency result suggests that it is 
appropriate to use the elastic- net to construct the adaptive weights. 

Theorem 3.2. Let us write /3* = (/3^,0) and define 

(3.1) P\ = argmini ||y - X^/3||i + A2 fi] + X\Y. ^jl/^il l 
^ I j&A j&A ) 

Then, with probability tending to 1, ((1 + ^)/3^,0) is the solution to (2.2). 

Theorem 3.2 provides an asymptotic characterization of the solution to the 
adaptive elastic-net criterion. The definition of (3j^ borrows the concept of 
"oracle" [Donoho and Johnstone (1994), Fan and Li (2001), Fan and Peng 
(2004) and Zou (2006)]. If there was an oracle informing us the true subset 
model, then we would use this oracle information and the adaptive elastic- 
net criterion would become that in (2.3). Theorem 3.2 tells us that, asymp- 
totically speaking, the adaptive elastic-net works as if it had such oracle 
information. Theorem 3.2 also suggests that the adaptive elastic-net should 
enjoy the oracle property, which is confirmed in the next theorem. 

Theorem 3.3. Under conditions (Al)-(A6), the adaptive elastic-net 
has the oracle property; that is, the estimator /3(AdaEnet) must satisfy: 

1. Consistency in selection: Pr({j : /3(AdaEnet)j 7^ 0} = .A) — > 1, 

2. Asymptotic normality: q-^ ^^^^^-^ Sy^(/3(AdaEnet)^ — /3^) — >rf A^(0, cr^), 
where 5]_4 = XjX_4 and a is a vector of norm 1. 

By Theorem 3.3, the selection consistency and the asymptotic normality 
of the adaptive elastic-net are still valid when the number of parameters 
diverges. Technically speaking, the selection consistency result is stronger 
than that Theorem 3.2 implies, although Theorem 3.2 plays an important 
role in the proof of Theorem 3.3. As a special case, when we let A2 = 0, 
which is a choice satisfying conditions (A5) and (A6), Theorem 3.3 tells us 
that the adaptive lasso enjoys the selection consistency and the asymptotical 
normality 

Q^sy^(3(AdaLasso)^ - P*^) ^ iV(0, cj^). 
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4. Numerical studies. In this section, we present simulations to study 
the finite sample performance of the adaptive elastic-net. We considered five 
methods in the simulation study: the lasso (Lasso), the elastic-net (Enet), 
the adaptive lasso (ALasso), the adaptive elastic-net (AEnet) and the SCAD. 
In our implementation, we let A2 = in the adaptive elastic-net to get the 
adaptive lasso fit. There are several commonly used tuning parameter selec- 
tion methods, such as cross-validation, generalized cross-validation (GCV), 
AIC and BIC. Zou, Hastie and Tibshirani (2007) suggested using BIC to se- 
lect the lasso tuning parameter. Wang, Li and Tsai (2007) showed that for 
the SCAD, BIC is a better tuning parameter selector than GCV and AIC. 
In this work, we used BIC to select the tuning parameter for each method. 

Fan and Peng (2004) considered simulation models in which p„ = [4n^/^] — 
5 and |^| = 5. Our theory allows pn = 0{n'^) for any u < 1. Thus, we are 
interested in models in which = 0{n'^) with v > ^.In addition, we allow 
the intrinsic dimension {A) to diverge with the sample size as well, because 
such designs make the model selection and estimation more challenging than 
in the fixed |^| situations. 

Example 1. We generated data from the linear regression model 

where (3* is a p-dim vector and e ~ A^(0, ci^), (7 = 6, and x follows a p-dim 
multivariate normal distribution with zero mean and covariance S whose 
(j, k) entry is Sj^fc = p'-'"^', I < k,j < p. We considered p = 0.5 and p = 0.75. 
Let p = Pn = [4n^/^] - 5 for n = 100, 200, 400. Let Im/Om denote a m-vector 
of I's/O's. The true coefficients are /3* = (3 • 1^,3 • 1^,3 • lg,0p_3g)-^ and 
1^1 = 3^ and q = [pn/9]- In this example 1^ = 1; hence, we used 7 = 3 for 
computing the adaptive weights in the adaptive elastic-net. 

For each estimator /3, its estimation accuracy is measured by the mean 
squared error (MSE) defined as E[{P — /3*)'^S(/3 — /3*)]. The variable se- 
lection performance is gauged by {C,IC), where C is the number of zero 
coefficients that are correctly estimated by zero and IC is the number of 
nonzero coefficients that are incorrectly estimated by zero. 

Table 1 documents the simulation results. Several interesting observations 
can be made: 

1. When the sample size is large (n = 400), the three oracle-like estimators 
outperform the lasso and the elastic-net which do not have the oracle 
property. That is expected according to the asymptotic theory. 

2. The SCAD and the adaptive elastic-net are the best when the sample 
size is large and the correlation is moderate. However, the SCAD can 
perform much worse than the adaptive elastic-net when the correlation 
is high (p = 0.75) or the sample size is small. 
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3. Both the elastic-net and the adaptive lasso can do significantly better 
than the lasso. What is more interesting is that the adaptive elastic-net 
often outperforms the elastic-net and the adaptive lasso. 

Example 2. We considered the same setup as in Example 1, except 
that we let p = Pn = ^n^/^] - 5 for n = 100, 200, 800. Since = |, we used 
7 = 5 for computing the adaptive weights in the adaptive elastic-net and 
the adaptive lasso. The estimation problem in this example is even more 
difficult than that in Example 1. To see why, note that when n = 200 the 
dimension increases from 51 in Example 1 to 131 in this example, and the 
intrinsic dimension (|.A|) is almost tripled. 

The simulation results are presented in Table 2, from which we can see 
that the three observations made in Example 1 are still valid in this example. 
Furthermore, we see that, for every combination of {n,p, \ A\,p), the adaptive 
elastic-net has the best performance. 

5. Ultra-high dimensional data. In this section, we discuss how the adap- 
tive elastic-net can be applied to ultra-high dimensional data in which p> n. 
When p is much larger than n, Candes and Tao (2007) suggested using the 
Dantzig selector which can achieve the ideal estimation risk up to a log(p) 
factor under the uniform uncertainty condition. Fan and Lv (2008) showed 
that the uniform uncertainty condition may easily fail and the log(p) fac- 
tor is too large when p is exponentially large. Moreover, the computational 
cost of the Dantzig selector would be very high when p is large. In order to 
overcome these difficulties. Fan and Lv (2008) introduced the Sure Indepen- 
dence Screening (SIS) idea, which reduces the ultra-high dimensionality to a 
relatively large scale dn but dn < n. Then, the lower dimension methods such 
as the SCAD can be used to estimate the sparse model. This procedure is 
referred to as SIS + SCAD. Under regularity conditions, Fan and Lv (2008) 
proved that SIS misses true features with an exponentially small probabil- 
ity and SIS + SCAD holds the oracle property if dn = o(n^/^). Furthermore, 
with the help of SIS, the Dantzig selector can achieve the ideal risk up to a 
log(d„) factor, rather than the original log(p). 

Inspired by the results of Fan and Lv (2008), we consider combining the 
adaptive elastic-net and SIS when p > n. We first apply SIS to reduce the 
dimension to dn and then fit the data by using the adaptive elastic-net. We 
call this procedure SIS + AEnet. 

Theorem 5.1. Suppose the conditions for Theorem 1 in Fan and Lv 
(2008) hold. Let dn = 0{n'^), v < 1; then, SIS + AEnet produces an estimator 
that holds the oracle property. 
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Table 1 

Simulation I; model selection and fitting results based on 100 replications 



n 




1^1 


Model 


MSE 


C 


IC 








p = 0.5 










lUU 


60 




irutn 






ZD 


U 








Lasso 


7.57 


(0.31) 


0/1 flfi 


U.Ul 








ALasso 


6.78 


(0.42) 


25.50 


0.42 








Enet 


5.91 


(0.29) 


24.06 











AEnet 


5.07 


(0.35) 


25.47 


0.15 








SCAD 


10.55 


(0.68) 


22.54 


0.35 


200 


51 


15 


Truth 






36 











Lasso 


6.63 


(0.24) 


33.32 











ALasso 


3.78 


(0.18) 


35.46 


0.02 








Enet 


4.86 


(0.19) 


33.36 











AEnet 


3.46 


(0.17) 


35.47 


0.01 








SCAD 


4.76 


(0.33) 


34.63 


0.10 


400 


75 


24 


Truth 






51 











Lasso 


4.99 


(0.15) 


47.31 











ALasso 


2.76 


(0.09) 


50.33 











Enet 


3.37 


(0.12) 


48.00 











AEnet 


2.47 


(0.08) 


50.45 











SCAD 


2.42 


(0.09) 


50.88 











p = 0.75 










1 nn 
lUU 




o 


irutn 






ZD 


U 








Lasso 


5.93 


(0.26) 


Z4.oU 


n 1 /I 
U. 14 








ALasso 


8.49 


(0.39) 


25.76 


1.84 








Enet 


4.18 


(0.24) 


24.77 


0.05 








AEnet 


5.24 


(0.32) 


25.70 


0.74 








SCAD 


11.59 


(0.56) 


22.46 


1.34 


200 


51 


15 


Truth 






36 











Lasso 


5.10 


(0.18) 


34.66 


0.02 








ALasso 


5.32 


(0.31) 


35.70 


0.87 








Enet 


3.79 


(0.17) 


34.79 











AEnet 


3.32 


(0.17) 


35.80 


0.19 








SCAD 


5.99 


(0.31) 


33.10 


0.35 


400 


75 


24 


Truth 






51 











Lasso 


3.83 


(0.12) 


49.03 











ALasso 


2.85 


(0.12) 


50.53 


0.09 








Enet 


3.24 


(0.11) 


49.07 











AEnet 


2.71 


(0.09) 


50.54 


0.03 








SCAD 


3.64 


(0.17) 


48.43 


0.09 



We make a note here that Theorem 5.1 is a direct consequence of The- 
orem 1 in Fan and Lv (2008) and Theorem 3.3; thus, its proof is omitted. 
Theorem 5.1 is similar to Theorem 5 in Fan and Lv (2008), but there is 
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Table 2 

Example 2: model selection and fitting results based on 100 replications 



n p„ \A\ Model MSE C IC 



100 81 27 



200 131 42 



800 339 111 



100 81 27 



200 131 42 



800 339 111 



p = 0.5 



Truth 



Lasso 


31 


73 


(1 


ALasso 


28 


<0 


[1 


Enet 


27 


61 


(1 


AEnet 


20 


27 


(0 


op An 




oo 


^'9 
v 


Truth 








Lasso 


23 


41 


(0 


ALasso 




( U 


(U 


Enet 


18 


94 


(0 


AEnet 


10 


68 


(0 




1 A 


1 A 




Irutn 








Lasso 


1 O 

13 




/'A 

[U 


ALasso 


/J 



44 


/'A 
[U 


Enet 


ii 


no 
02 


( A 


AEnet 


6 


00 


(0 




I 


< 9 




p = 0.75 








Truth 








Lasso 


22 


04 


(0 


ALasso 


33 


98 


(1 


Enet 


17 


37 


(0 


AEnet 


16 


18 


(0 


SCAD 


31 


84 


(1 


Truth 








Lasso 


16 


71 


(0 


ALasso 


20 


98 


(0 


Enet 


14 


12 


(0 


AEnet 


11 


16 


(0 


SCAD 


15 


27 


(0 


Truth 








Lasso 


10 


01 


(0 


ALasso 


6 


39 


(0 


Enet 


8 


01 


(0 


AEnet 


6 


23 


(0 


SCAD 


6 


62 


(0 





54 





06) 


47.06 


0.19 


oo\ 
22j 


C O A 1 


O 1 o 

2.12 


04) 


46.35 


0.13 


94) 


53.00 


1.15 


65) 


47.79 


2.37 




89 





67) 


80.51 





4c5j 


cy AA 


A 1 /I 

0.14 


61) 


OA OT 

80.2 I 


A 



37) 


87.97 





D4 j 


8 ( .42 


A OC; 

U.2o 




228 





23) 


212.10 





12) 


ooi? 1 
22D.D1 


A 




18) 


O 1 O A1 

2io.yi 


A 




10) 


oot? 'yc; 
22d. (0 


A 
U 


30) 


or* o r\r\ 

228.00 


0.33 
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73) 


50.74 


0.71 


08) 


53.73 


7.19 


62) 


50.82 


0.46 


80) 


53.67 


2.36 


77) 


50.55 


4.74 




89 





50) 


85.17 


0.06 


92) 


88.64 


3.98 


48) 


85.35 


0.05 


46) 


88.60 


0.87 


61) 


87.20 


1.33 




228 





16) 


221.74 





12) 


226.89 





13) 


222.74 





11) 


226.94 





17) 


228.00 


0.29 



a difference. SIS + AEnent can hold the oracle property when (i„ exceeds 
0(?ii/'^), while Theorem 5 in Fan and Lv (2008) assumes (i„ = o{n^l'^\ 
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Table 3 

A demonstration of SIS+ AEnet: model selection and fitting results based on 100 

replications 



d„ = [5.5n^/^] 


Model 


MSE 


C 


IC 


188 


Truth 




992 







SIS + AEnet 


0.71 (0.18) 


987.45 


0.05 




SIS + SCAD 


1.48 (0.90) 


982.20 


0.06 



To demonstrate SIS + AEnet, we consider the simulation example used in 
Fan and Lv (2008), Section 3.3.1. The model is y = + 1.5Af(0, 1), where 
13* = ((Sf, Op_|_4|)"^ with = 8. Here, I3i is a 8-dim vector and each com- 
ponent has the form ( — l)"(an + |-z|), where a„ = 41og(n)/-^/n, u is randomly 
drawn from Ber(0.4) and z is randomly drawn from the standard normal 
distribution. We generated n = 200 data from the above model. Before ap- 
plying the adaptive elastic-net, we used SIS to reduce the dimensionality 
from 1000 to dn = [5.5n^/'^] = 188. The estimation problem is still rather 
challenging, as we need to estimate 188 parameters by using only 200 obser- 
vations. From Table 3, we see that SIS + AEnet performs favorably compared 
to SIS + SCAD. 

6. Proofs. 

Proof of Theorem 3.1. We write 

3(A2, 0) = argmin ||y - X/3||i + A2||/3||i 

By the definition of [!i^{X2-,^i) and /3(A2,0), we know 

||y - X3^(A2, + A2||3w(A2, > ||y - X3(A2,0)||2 + A2||3(A2,0)||i 
and 

||y - X3(A2,0)||1 + A2||3(A2,0)||i + Ai f]tI;,|/3(A2,0),| 

i=i 

>||y-X3^(A2,Ai)||l + A2||3w(A2,Ai)||i + AiX^^I;,|/3w(A2,Ai),-|. 

i=i 

From the above two inequalities, we have 

p 

XiJ2wjmX2,O)j\-0^{X2,Xi)j\) 
i=i 



(6.1) 



>(||y-X3w(A2,Ai)||i + A2||3w(A2,Ai)||i) 
-(||y-X3(A2,0)||i + A2||3(A2,0)||i). 
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On the other hand, we have 

(||y-X3^(A2,Ai)||2 + A2||3w(A2,Ai)||i) 
-(||y-X3(A2,0)||2 + A2||3(A2,0)||2) 



and 



(3w(A2, Ai) - 3(A2, 0))^(X^X + A2l)(3^(A2, Ai) - 3(A2, 0)) 



5^w;,(|/3(A2,0),|-|/3w(A2,Ai),|) 
p 

<^?i,|/3(A2,0),-/3w(A2,Ai),| 
i=i 



< 



5]u;2||3(A2,0)-3^(A2,Ai)||2. 



Note that Amin(X X + A2I) = Amin(X X) + A2. Therefore, we end up with 

\l 



(A„,in(X^X) + A2)||/3^(A2,Ai)-/3(A2,0)||? 



(6.2) < (3^(A2,Ai) -3(A2,0)f (X^X + A2l)(3^(A2,Ai) -3(A2,0)) 
<Ai, 



X:^i']||3(A2,0)-3^(A2,Ai)||2, 



which results in the inequahty 

(6.3) ||/3w(A2,Ai)-/3(A2,0)||2< ^ ' ' 



A^in(X^X) + A2' 

Note that 

3(A2, 0) -13* = - A2(X^X + A2l)-^/3* + (X^X + A2l)~^X^£, 
which imphes that 

^(||3(A2,0)-/3*||i) 

< 2Ai||(X^X + A2l)"^/3*i + 2^(||(X^X + A2l)"^X^£||i) 
<2Al(A^in(X^X) + A2)~'||/3*i 

(6.4) 

+ 2(A^in(X^X) + A2)"'^(£^XX^e) 
= 2(A^i„(X^X) + \2y\\lW\\l + Tr(X^X)a2) 

< 2(A^in(X^X) + A2)^'(Ai||/3*||i +pA„,ax(X^X)cT2). 
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Combing (6.3) and (6.4), we have 
S(||3^(A2,Ai)-/3*||2) 

< 2E{\\PiX2,0) - (3*\\l) + 2Ei0^{X2, Ai) - M^2,0)\\l) 
4Ai 11/3* Hi + 4pA^ax(X^X)c72 + 2AfSE^=i w]] 



(6.5) < 



(6.6) < 4 



(A„,in(X^X)+A2)2 



[)n + As 



\2 



We have used condition (Al) in the last inequahty. When wj = 1 for all j, 
we have 

£;(||/3(A2,Ai)-/3 ||2)<4 ^t>n + X2r " □ 

Proof of Theorem 3.2. We show that ((l + ^)/9^,0) satisfies the 
Karush~Kuhn-Tucker (KKT) conditions of (2.2) with probability tending 
to 1. By the definition of /3_4, it suffices to show 

Pr(Vj G A' \-2Xj{y - X_aP*a)\ < K^j) ^ 1 
or, equivalently, 

Pr(3i G A' \-2Xj{y - X^fy] > Xlwj) ^ 0. 
Let r/ = minjg_4(|/3*|) and 17 = minjg_4(|/3(enet)*|). We note that 
Pr(3i G A' \-2Xj{y - XaP*a)\ > Xlwj) 

< Fii\-2Xj{y-XAP*A)\>K^j^V>v/2) + Piiri<ri/2), 
jeA" 

Pr(r/ < r?/2) < Pr(||3(enet) - /3* h > r?/2) < M'^^^^^*) " ^* "s) 



r/2/4 

Then, by Theorem 3.1, we obtain 

(6.7) Pr(r? < r//2) < 16 , \ \2 2 ■ 

[on + A2j 

Moreover, let M = (^)^^^^~'~'^\ and we have 

J2 Pri\-2Xjiy-XAPA)\> ^l^i,fl>v/2) 

< FT{\-2Xjiy-XAP*A)\>K^j,V>v/2,menet),\<M) 
j&A'= 
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+ Pr(|/3(enet)j| >M) 

< Yl Pr{\-2Xjiy-X^P*_^)\>XlM-\fi>^/2) 

+ Y Pr(|/3(enet)j-| >M) 



(6.8) <^e(y: \Xj{y-XA(3*_^)\'l{f,>v/2)] 

<^^f E \Xj{y-^ArAriiv>v/2)] 



E{\\f3{enet)-f3*g) 
M2 



<^^f E \xJ{y-^APA)\'m>v/2)] 

Aj 11/3* 111 + i3pncT2 + Afp 
(6n + A2)2M2 

where we have used Theorem 3.1 in the last step. By the model assumption, 
we have 



<2 ^ \xJ{y.j,(3\-y.Ji\)\'' + 2 Y I^J^I' 

j€A'= j&A" 



j^A" 



<2Bn.Bn\\f3*_^-P*jl + 2 ^ \X] 

j<^A 



T |2 



which gives us the inequahty 



E \Xj{y-XAPA)\'nv>v/2)] 
XjeA" ) 

(6-9) 

< 2B''r?E{WA - > r]/2)) + 2Bnpa\ 
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We now bound E{\\l3*^^ - ^^||i/(i7 > r//2)). Let 

P*_^iX2,0) = argmini ||y - X^/3||2 + A2 ^ /3| 1 . 

Then, by using the same arguments for deriving (6.1), (6.2) and (6.3), we 
have 



<4 



(0.10) - i3-Ah, o)ib < 'I ^ '^1^ 7rr a 

Amin(Xj^X^) + A2 bn + X2 
Note that Amin(X5X^) > Amin(X^X) > bn and Amax(X5X^) < Amax(X^X) < 

Bn. Following the rest arguments in the proof of Theorem 3.1, we obtain 

E{\\P\-~P*jll{f,>v/2)) 

+ A.nax(X^X^)|.4|a^ + Xfiv/2)-'^\A\ 
^ • ^ - (A^in(X5X^) + A2)2 

A| 11/3* III + Bpna^ + Af (r//2)-^> 
(6n + A2)2 ■ 

The combination of (6.7), (6.8), (6.9) and (6.11) yields 

Pr(3j e A'' \-2Xj{y - > Xlwj) 

AM^^n ( 2 Xl\ml + Bpna^ + Xf{rj/2r^"'p , 

XlWWl + Bpna'^ + XIp 4 Ai||/3*||i + .Bpraa^ + Afp 16 
^ (6n + A2)2 M2 ^ (6n + A2)2 If 

= K1 + K2 + K3. 

We have chosen 7 > then, under conditions (A1)-(A6), it follows that 

X* \ -2/(1+7)^ 



(6.12) 



p / n n2/(i+7)x 



n V A ^^ 



K2 = 0{^[-] )-0, 



^1 

2 



n rj^ 



R) \2/7 /„ / „ \2/(l+7)\ (l+7)/7 N 



77, / V 72 V A^ 



Thus, the proof is complete. □ 
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Proof of Theorem 3.3. From Theorem 3.2, we have shown that, 
with probability tending to 1, the adaptive elastic-net estimator is equal to 
((l + ^)/3^,0). Therefore, in order to prove the model selection consistency 
result, we only need to show Pr(minjg^ \(3* \ > 0) — > 1. By (6.10), we have 

min|/3*| >min|/3*(A2,0),| - ^^f^- 
jeA ■' jeA bn + X2 

Note that 

min I r ( A2 , 0),- 1 > min I /3; I - 1 ( A2 , 0) - /3::i 1 1 2 . 

j£A jgA 

Following (6.6), it is easy to see that 

^(ll3^(A.0)-3i||^)<4«t^ = o(^ 

(6n + A2)'' \n 

Moreover, %f|^ = 0(^)(^r?-^)(f)-^ and 



2 



<2 + ^EimX2,Xi)-P*\\l) 



<2 + 



7]^ 

8 Xl\\/3*\\l + Bpna"^ + XIp 



tf {hn + X2Y 

In (6.12) we have shown rj^— — > oo. Thus, 

(6.13) ^v%i = „f l)op(l). 

6n + A2 \^/^J 

Hence, we have 

min|/3*| >r?-,/^Op(l)-of^)op(l 



j&A " ^ ' \ n ' ' ' \y/n J 

and Pr(minjg^ |/3*| > 0) ^ 1. 

We now prove the asymptotic normality. For convenience, we write 

= ^^ VA'^/ ^y'(3(AdaEnet)^ - I3\). 

Note that 
.-(I + A.E-)Ei'=(5i-^ 

n-\- A2 



+ a^I + A2l]^^)S:(^(/3^(A2, 0) - 



18 H. ZOU AND H. H. ZHANG 

In addition, we have 

(I + \2^~^^)T}1\Wa{>^2.^) - f^*A) = -^2^A^'P*A + ^A^'^A^- 

Therefore, by Theorem 3.2, it follows that, with probability tending to 1, 
Zn = Ti + T2 + Ts, where 

n -\- A2 

T2 = + X2^^')^]i\WA - WAi^2,0)), 

We now show that Ti = o(l),T2 = op(l) and T3 —>■ N{0,a'^) in distribution. 
Then, by Slutsky's theorem, we know z„ -^^ N{0, cr^). By (Al) and q^q; = 1, 
we have 

2 

-1/2^* ||2 



<2 



2 



Hence, it follows by (A6) that Ti = o(l). Similarly, we can bound T2 as 
follows: 

2 



Ti<(i+^) \\^f{p*^-p2>^2m\2 



2 



bn J 

<{l + ^)'Bn\\P*A-~0*Ai^2M\l 

-\ hn) \hn + \2) ' 

where we have used (6.10) in the last step. Then, (6.13) tells us that = 
^Op(l). Next, we consider T3. Let X^[i,] denote the ith row of the ma- 
trix X_4. With such notation, we can write Ts = X^iLi^i^ii where ri = 
Q!^(X^X_4)~-^/^(X_4[i, ])-^. Then, it is easy to see that 

n n 

Y^rf = Y^c.^iX^^XA)-'/\XA[^Af{^A[^,m^^A^Ar'^'c. 

i=l i=l 

(6.14) = a^(xSx^)-V2(x5x^)(xSx^)-V2a 
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Furthermore, we have for k = 2 + 5,5 > 

n / ^ / 

^=l \^=l ^ ' 

= ^[|ep+^]fmax|r2|J . 

Note that < \\Tr2l\^A\iAf < (E,e^4)(^-ax(5^:4')) < Hence, 
(6.15) p^E[\e,f+%^+'\<E[\en(^^^^ 

From (6.14) and (6.15), Lyapunov conditions for the central hmit theorem 
are estabhshed. Thus, -^^ N{0,a'^). This completes the proof. □ 

Acknowledgments. We sincerely thank an associate editor and referees 
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